-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[PowerPC] Implement a more efficient memcmp in cases where the length is known. #158657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-powerpc Author: zhijian lin (diggerlin) ChangesFor For example when we compile this:
|
if (Subtarget.hasVSX()) { | ||
if (LHS.getOpcode() == ISD::LOAD && RHS.getOpcode() == ISD::LOAD && | ||
LHS.hasOneUse() && RHS.hasOneUse() && | ||
LHS.getValueType() == MVT::i128 && RHS.getValueType() == MVT::i128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the type restriction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the pass expand-memcmp
%bcmp = tail call i32 @bcmp(ptr noundef nonnull dereferenceable(16) %a, ptr noundef nonnull dereferenceable(16) %b, i64 16)
%cmp = icmp eq i32 %bcmp, 0
%conv = zext i1 %cmp to i32
ret i32 %conv
is changed to
%0 = load i128, ptr %a, align 1
%1 = load i128, ptr %b, align 1
%2 = icmp ne i128 %0, %1
%3 = zext i1 %2 to i32
%cmp = icmp eq i32 %3, 0
%conv = zext i1 %cmp to i32
ret i32 %conv
but in original code, the load i128, ptr %a, align 1
is lowered to
t27: i64,ch = load<(load (s64) from %ir.a, align 1)> t0, t2, undef:i64
t32: i64,ch = load<(load (s64) from %ir.b, align 1)> t0, t4, undef:i64
in 64-bit mode, it is not efficient with two ld
instruction in 64-bit mode or four lwz
in 32-bit mode.
we want to i128 to be converted to vector load. so there is type restriction.
auto *LA = dyn_cast<LoadSDNode>(LHS); | ||
auto *LB = dyn_cast<LoadSDNode>(RHS); | ||
if (!LA || !LB) | ||
return SDValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't all conditions that are not meet for this optimization results in the default behaviour for this function on line 15618 below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I will fix it, thanks, I thought that the ISD::SETCC only return i1, and the function PPCTargetLowering::DAGCombineTruncBoolExt only deal with i32 and i64, so I return SDValue() here directly. but the ISD::SETCC maybe return i32/i64 too.
SDValue Add = DAG.getNode(ISD::ADD, DL, OpVT, LHS, RHS.getOperand(1)); | ||
return DAG.getSetCC(DL, VT, Add, DAG.getConstant(0, DL, OpVT), CC); | ||
} | ||
if (Subtarget.hasVSX()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a documentation as to the type of optimization that is being done in this block.
SDValue Ops[] = {IntrID, CRSel, LHSVec, RHSVec}; | ||
SDValue PredResult = | ||
DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32, Ops); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops[]
is not needed, just inine it to the call?
gentle ping. |
The PR description should probably say that the case optimized is not just known length but EQ/NE only. Also that it extends existing handling to 64 bit up to 128 bits. |
|
||
entry: | ||
%call = tail call signext i32 @memcmp(ptr %buffer1, ptr %buffer2, i64 65) | ||
%call = tail call signext i32 @memcmp(ptr %buffer1, ptr %buffer2, i64 165) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't 129 be the equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes . I think I can change to 129
} | ||
|
||
// Optimization: Fold i128 equality/inequality compares of two loads into a | ||
// vectorized compare using vcmpequb.p when VSX is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say something about this gets inline memcmp.
// as v16i8 vectors and use the Altivec/VSX vcmpequb.p instruction to | ||
// perform a full 128-bit equality check in a single vector compare. | ||
|
||
if (Subtarget.hasVSX()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vcmpequb is an Altivec vector instruction, not VSX.
// as written to ensure correct program behavior, so we return an empty | ||
// SDValue to indicate no action. | ||
if (LA->isVolatile() || LB->isVolatile()) | ||
return DAGCombineTruncBoolExt(N, DCI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return pattern is going to make it hard to further modify the combineSetCC function. I think this code should be outlined to a separate function.
LB->getBasePtr(), LB->getMemOperand()); | ||
|
||
SDValue IntrID = | ||
DAG.getTargetConstant(Intrinsic::ppc_altivec_vcmpequb_p, DL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just use getConstant.
; CHECK-NEXT: mfocrf 3, 2 | ||
; CHECK-NEXT: rlwinm 3, 3, 25, 31, 31 | ||
; CHECK-NEXT: cntlzw 3, 3 | ||
; CHECK-NEXT: srwi 3, 3, 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra instruction? I think isolating and flipping the bit can just be rlwinm/xori.
; CHECK-NEXT: xor 4, 4, 5 | ||
; CHECK-NEXT: or 3, 3, 4 | ||
; CHECK-NEXT: cntlzd 3, 3 | ||
; CHECK-NEXT: rldicl 3, 3, 58, 63 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not change this sequence? It seems like a side effect and I'm not sure it's faster or slower.
For
int memcmp ( const void * ptr1, const void * ptr2, size_t num );
in cases where thesize_t num
parameter is known at compile time we can do a better job of generating code.For example when we compile this: